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Abstract . 

The focus of this study was on the relationship between visual 
and statistical analyses of time series data and the degree to which 
characteristics of the data influenced this relationship. A total of 
52 subjects took part in evaluating a series of graphs having pre- 
specified characteristics. . The independent—variables manipulated 
included: slope, variability, training, and aimline/decision rules. 
Generally, the influence of these variables was significant and in the 
predicted direction. However, the overall level of relationship 
between the two analytic procedures was modest. The implications of 
this research are discussed in terms of the manner in which visual 
analysis is conducted and the procedures needed to establish 
statistical conclusion validity. 



Factors Influencing the Agreement Between Visual 
and Statistical Analyses of Time Series Data ° 

For nearly 10 years now, there has been a controversy in the 
behavioral literature over the appropriate analysis of time ser,ies 
data. Many arguments have been presented both for and against the use 
of statistics in such analyses. However, relatively few studies have 
been conducted to investigate the use of statistical analysis. In 
part, this has been due to the unique characteristics of time series 
data, which often make it difficult to use any procedures other than 
visual analysis. 

While classical statistical procedures developed from R. A. 
Fisher (1951) have been discounted (cf. Journal of Applied Behavior 
Analysis , v. 7, No. 4, 1974), several alternative procedures have been 
proposed, including statistical analyses uti 1 i zing time series 
analysis (Glass, Willson, & Gottman, 1975; Gottman, 1973; Gottman & 
Glass, 1978), randomization tests (Edgington, 1967, 1972a, 1972b), the 
Rn statistic (Revusky, 1967), and the c-statistic (Tryon, 1982). .The 
use of statistics in analyzing single subject data must address 
several issues that^are not relevant in the more traditional between^- 
group analysis. A central problem in the .case of repeated measurement 
over time is the degree to which successive data points are related to 
each other. This characteristic usually is referred to as serial 
dependency or autocorrelation in which there is "a correlation (r) 
between - data points separated by different time intervals (lags) in 
the series" (Kazdin, 1976, p. 273). , 

In statistics using an analysis of variance model (F, t). the 
assumption of independence- of error components is critical and if 
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violated, precludes the use of such techniques as an appropriate 
alternative. Tn time series data, there is considerable serial 
d^endency present and as a result the error components cannot be 
assumed to be independent. Tn an empirical analysis, Kratochwill, 
Alden, Demuth, Dawson, Panicucci, Arntson, McMurray, Hempstead, and 
Levin (1974) demonstrated that the assumption of statistical 
independence necessary for use of ANOVA is unwarranted in an N s l 
.design. For example; given two behavioral outcomes (e;g., on-task and 
off-task), the probability of any one occurrence is dependent upon 
previous occurrences, both in terms of the number of times the 
sequence changes from one outcome to another (from on-task to 
off-task) as well as the number of consecutive observations 
characterized by the same behavior (the length of strings of on-task 
and off-task occurrences). 

The presence of serial dependency has other effects on 
conventional analyses. First, the number of independent sources of 
information in the data is reduced, resulting in ;*n overestimate of 
the true value of F or t. Second, there is a spurious reduction in 
the variability of the data, resulting in an underestimate of., the 
variability that would have been obtained from independent 
observations, the net effect of which is a positively biased F or t 
(Kazdin, 1976). Third, Type I error is underestimated for positive 
autocorrelation and overestimated for negative autocorrelation 
(Scheffe, 1959). The use of ANOVA procedures on time series data will 
also result in the magnitude of the mean square (MS) error being 
greatly increased and, as a consequence, the probability of detecting 



a true treatment effect wi 1 1 be decreased . Error variance in th is 
node! is calcul ated from the [ deviat ion of scores with in cond it ions 
from the condition mean, with no account taken of trend in fhe data. 
When all data points are included in determining the mean, rather than 
using* only those obtained at the point at which asymptotic levels of 
performance have been reached (the peak level * of responding) , the 
problem cited above, increasing the magnitude of the MS error, 1s 
exacerbated (Hartmann, 1974). 

Gottman and Glass (1978), in arguing for the use of statistics, 
countered that the large effects that operant psychologists are 
accustomed to detecting are the unique phenomenon of the experimental 
laboratory. In applied settings, such as schools and hospitals, there 
is little control over many of the environmental stimuli, and as a 
consequence, "one has every reason to expect small effects outside the 
laboratory 11 (p. 199). : Therefore, they argued that small effects 
detectable by sensitive statistical procedures, but undetectable by 
visual analysis, are important to know about if further research and 
experimental manipulations are to be investigated. Kazdin (1975) 
agreed, noting that interventions that show only a modest effect alone 
may be' important when added to other interventions. Other situations 
in which some argue for the use of statistical procedures include: 
(a) instances where it is difficult to achieve a stable baseline and 
time or ethical constraints preclude further waiting; (b) when visual 
analysis is equivocal and the effects of changes are ambiguous; and 
(c) in applied settings characterized by uncontrolled variation 
(Kazdin, 1976). 



According to Glass et a!./ '1975), data from time series 
experiments that do not appeary statistically significant when visually 
inspected, often tur.n - out to be s ignif icant when "appropriately 
tested" (p. 62). When Jones, Weinrott, and Vaught (1975) reanalyzed 
the Hall, Fox, Willard, Goldsmith, Emerson, Owen, Davis, and Porcia 
(1971) study, they found that different conclusions might be reached 
in relying on visual and statistical criteria for evaluating change in 
trend. Here, changes in slope "appeared 11 to be significant but were 
not when analyzed using statistical techniques. 

Gottman and Glass (1978) investigated the degree to which visual 
analysis and statistical analysis were in agreement. Thirteen 
graduate students in a seminar on time-series analysis were asked to 
inspect various graphs and judge whether or not an intervention effect 
was present. The same data were analyzed by methods outlined in 
detail by Glass et al. (1975), as a first 4 order integrated moving 
averages process, and a t statistic applied for each intervention. In 
sum, they found the "eyeball test" to give results that varied from 
judge to judge and were in sharp conflict with the findings of 
statistical tests. 

Jones, Vaught, t §nd Weinrott (1977) .reanalyzed several 
investigations reported in the Jourrral of Applied Behavior Analysis 
( JABA ) representing a great variety of scores and design properties, 
using time-series analysis. In some instances, time-series analysis 
corroborated the author' s visually based conclusions; in others the 
two analyses yielded different conclusions; and in still others, the 

timer-series analysis revealed findings not apparent in the authors 1 
visual analysis. - 



Finally, an investigation was conducted by Jones, Weinrott^ *nd 
Vaught (1978^ to ascertain to extent to which serial dependency 
influenced the agreement between inferences based on visual or time- 
series analysis. JABA graphs were presented to judges well versed in 
behavior charting and they were asked whether or not a meaningful 
change in level had been demonstrated from one phase to another. 
Graphs were selected by the authors in which the effects were 
sufficiently "nonobvious" to warrant critical analysis, and serial, 
dependency was apparent. The graphs were further blocked into three 
different levels of serial dependency by two levels of significance of 
.difference in level between phases. 

Their results indicated that agreement between visual analysis 
and time-series analysis was inversely related to the magnitude of the 
serial dependency in the scores. That is, the more serial dependency 
present, the less reliable visual analysis tended to be. Furthermore, 
they found that visual and time-series inferences agreed better when 
no statistical changes in level were present. Finally, an interaction 
effect was present in which visual, and time-series inferences were 
moi% in agreement' when the data showed neither serial dependency nor 
significant differences in level. In effect, judges tended to agree 
with time-series analysis that no effect was present but disagreed 
most when an effect was present. Intercorrelations among the 11 
judges ranged from .04 to .79 with a median of .39, suggesting fairly 
low concensus among judges and indicting the dependability of visual 
inferences. However, there was no relationship found between the 
reliability of the judges' and the degree of agreement with time-series 
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Inferences % Jones et al. (1978) considered their f Hidings Mow 
agreement with high serial dependency and statistically reliable 
changes in level, and high agreement vyjth low seri^ dependency and 
unreliable changes in level) to be contrary to' the purpose of research 
using an operant methodology. * 

The ortly investigation to appear in the literature raising doubt 
about the validity of statist ical analyses is that of Kazdin (1976). 
He noted that although statistical evaluation may be more reliable for 
a given pattern of data, it is not possible to state that it is 
generally more reliable. He cited an investigation by Gottman (1973) 
in which data were clearly significant by visual analysis, yet with 
the application of statistical time-series analysis, no significant* 
effects appeared. One explanation given .for this discrepancy -was the 
short duration of individual phases, making conclusions from the time- 
series analysis equivocal. 

- It is clear from the research - conducted to date that there are 
problems in analyzing time-series data. Whi]e traditional statistical 
methodology appears inappropriate, only a relatively few appropriate 
statistical procedures-' for time series data have been developed. 
However, the alternative, visual' analysis, appears very problematic 

(DeProspero & Cohen, 1979-; Gottmarr'& Glass, 1978; Jones et 'al ., 1978; 
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Tindal, Deno, £ Ysseldyke, 19^3; Wampold & Furlong, 1981). Generally, 
this research has found visual analysis to have low reliability and to' 
fluctuate quite dramatically with various characteristics of the data 
array. In an effort to improve this form of analysis, several .studies 
have been conducted on the development of guidelines for visually 



analyzing time series data. The major focus of this work has been on 
the use of aimlines and decision rules fBohannon, 1975; Oeno, Chiang, 
Tindal, & Blackburn, 1979; Liberty, 1972; Martin, '1980; Mirk in & Oeno, 
1979; Whita & Haring, 1980). Generally, for any given time series an 
aimline is established which begins at the current level of 
functioning and extends to the (goal) leve.l of expected outcome. 
Subsequently, decision rules are devised for* (flaking program changes, 
typically when the data fall below the aimline for twp .or three 
consecutive days. Ideally, the - use of such aiml ine/decis ion rules 
should provide systematicity . to the interpretation of data, though no' 
research has been conducted on the' the^ effects of using 
aimlines/dec is ion rules on the reliability of visual analysis. 

In summary, it appears that two different strategies have been 
followed in attempting to develop, an empirical basis for some formal 
type of analytic procedures: (a) investigations- in the field of 
statistical analysis which focus on the development of appropriate and 
sensitive techniques for analyzing " time-rseries* data, and (b) 
investigations focusing on the problems of visual analysis and the 
development of guidelines and J procedures for ameliorating these 
problems. T^e methodology of this research encompasses both 
strategies. Using a technique recently reported by Trybn (1982), time 
series data were analyzed ' statistical ly for determin ing treatment 
(intervention) effects. At the same timfe, the degree to 'which "these 
results were in concord with visual analysis Was investigated. Two 
evaluation components within the visual analysis of data were 
manipulated: the training of judges and .the use of aimlines. This 



.study represents a reanalysis of previous research conducted by;Tindal 
et al. (1983). 

■ \ . • 

Method ' , 

Subjects 

Subjects for this study were in-service and 1 pre-service teachers 
from three different locations around a large midwestern city. Two of 
the : sites were school districts, accounting for nine of the subjects, 
all of whom were currently teaching. Teachers in these two sites were 
assigned randomly to different treatment conditions, with the three 
subjects from one .district assigned to the experimental group and the 
six from the other district assigned to the control group. The 
remaining 42 subjects were students taking-^a required special 
education class at a large midwestern university. Most of these 
. subjects were currently teaching or former teachers.. Subjects from 
this, pool were assigned randomly to treatment groups In proportion to 
the number needed for bringing the experimental and control groups to 
approximately the same size. Twenty students were assigned to the 
control group and 28 were assigned to the experimental group. 

Training . The training of subjects involved both an in-seryice 
workshop and" a take-home training - module. The teachers in the 
experimental group were given training in the analysis of graphed data 
for evaluating instructional programs. This entailed explanations and 
exercises in , summarizing student performance and using it to make 
interpretations . Included in the summarization of time-series data 
Were computations of step changes, medians, slopes fusing the split- 
middle technique; White, 1971), variability (using total bounce; 



Pennypacker, Koenig,.& Lindsley, 1972), and overlap (Parsonson & Baer, 
1978). The rest of the workshop was devoted to the' use of this 
information for evaluating instruction. The teachers in the control 
group were given training in the development of measurement techniques 
in the areas of reading, writing, and spelling. They were trained in 
assessing students to determine performance discrepancies, sampl ing 
curriculum materials to find an appropriate instructional Tevel/ and 
developing a measurement system to monitor student improvement. , 

Both workshops lasted approximately 2h hours. Following the 
workshop, the experimental materials (graphs, response sheets, and 
directions) for 14 graphs were distributed. Following completion of 
these graphs (which ranged from one week for the subjects in. the class 
to three weeks for subjects in the schools), the second set of 14 
graphs was distributed. The completion and return of tjiis material 
again took one week for the subjects in the class and three weeks for 
those iri the schools. " • 

Experimental materials . A total of 28 different graphs were 
constructed in which slope "and variability were systematically 
manipulated. Two phases were 'displayed in each" graph— 11 data points 
in the baseline and 15 data points in the intervention phase. A 
vertical line was drawn separating the two phases. . The aimline 
represented a 30% -percent improvement' over the median of the last 
three* days during baseline. To ensure comparability between the 
graphs with and without aimlines, the absolute level of this median 
value was nearly the same across both aimlifie conditions within each 
respective level of slope; Although the slope was manipulated only in 



10 



the- intervention phase, variability was manipulated in both baseline 

and during the intervention, A total of three levels of slope and 

four conditions of variability were included in the graphs. 

With variability manipulated in both baseline and intervention, 

two different combinations of variability were included:* a bounce of 

5 data points and one of 15 data points. For every combination of 

slope, variability increased (5-15), decreased (15-5), remained at the 

same low level (5-5) or remained at the same high level (15-15), This 

resulted in the following combinations of graphed data: 

. (a) Six graphs showed an increase in. variabil ity from 

baseline to intervention from 5 data points to 15 data 
points, with a concurrent increase in slope from 0 to 
10 degrees for two graphs, an increase from 0 to 15 
* degrees in two graphs, and an increase from 0 to 20 

degrees for the final two graphs. Of these six graphs, 
three had an aimline drawn in during the intervention 

• phase, one from ^ach combination of slope and variability, 

(b) Six graphs showed a decrease in variabil ity from 15 

* data points in baseline to 5 data points in the 
intervention phase. For two of these graphs, the 
change. in. slope from baseline to intervention involved 
an increase from 0 to 10 degrees, two graphs 
depicted an increase from 0 to 15 degrees and two had 
an increase from 0 to 20 degrees. Again, an aimline 
was drawn in on half (three) of 'the above graphs, one 
from each combination. 

(c) Six graphs showed steady (unchanging) variability 
at a. low level (5 data points bounce) from baseline 
to intervention. Again, the slope changed from 0 

. to 10 degrees on two of the graphs, 0 to 15 degrees 
on two graphs, and 0 to 20 degrees on the final two 
"graphs. For each pair of slope-variability, one had 
... an aimline and one did not. < . 

(d) Six graphs showed steady (unchanging) variability 

at a high level (15 data points bounce) from baseline 
to intervention. €ach level of slope (10, 15," and 20 
degrees) was represented; two graphs displayed a change 
of 0 to 10 degrees, two graphs displayed a change .of 
0 to 15 degrees*, and two showed a change of 0 to 20 
degrees. Again, aimlines were present on half of 
these — one in each combination of* sJope-variability. 
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The final four graphs which were constructed had the following 
characteristics: 

(e) Four graphs that were given at time 1 were again 
given at time 2, with exactly the same data array 
depicted. All of these graphs displayed a low slope 
change (0 to 10 degrees) and constant variability 
(either the same low or same high variability). For 
each of the two variability conditions, one had an 
aimline present and one had no aimline present. 

Dependent variables . As noted previously, each subject was given 

14 of the graphs following training. Each graph had a response sheet 

containing several different questions. Responses to only the first 

question "Was the ' intervention depicted on the graph an effective 

one?" were attended to irf this investigation. Responses to thjs ; 

question consist 1 of rating the effectiveness on a 1-4 scale, with 1 

being definitely not effective and 4 being definitely effective. 

After the first set of 14 graphs and responses were collected, another 

set of 14 graphs was distributed. The order in which the graphs were 

organized (and completed) was determined randomly for both groups of 

subjects. 

All subjects 1 responses to the original research question were 
recoded as dichotomous responses signifying judgments of either the 
presence or absence of an effective intervention. The four-point 
scale therefore was reduced to a two-point scales with ones 
(definitely not effective) and ^twos (possibly effect ive) recoded as 
zeros, while threes (moderately effective) and fours (definitely- 
effective) were recoded as ones. A zero represented the judgment of 
the intervention as .having no effect, while a one represented a 
judgment of the intervention as having an effect. 

. ' -A 16 



All graphs were reanalyzed using the time-series analysis 
procedures described by Tryon (1982). This technique had been 
reported as particularly amenable for use on time series having a low 
number of data points. In this study, there were only 15 data points 
in the intervention phase. As Tryon (1982) noted, these statistical 
procedures may be used either within phases or between phases. His 
description for use between phases was, however, far less expl icit, 
and indeed, the phase change was ignored in the example analysis he 
provided in the report. Rather than incorporating baseline data with 
intervention data, as done in that report, the statistical analyses 
conducted here involved only intervention phase data. 

With both visual analyses and statistical analyses organized into 
dichotomous variables indicating' either significant or nonsignificant 
intervention-effects, the data were cross-tabulated using a chi-square 
analysis. A comparison between visual and statistical analyses was 
conducted for all of the major independent variables including the 
effect of slope, variability, training in data utilization and the use 
of aimlines, as well as reliability over time. No interactions were 
investigated in this study (see Tindal, Deno, & Ysseldyke for 
interaction analyses). - 

Results * 

The results of the visual and statistical analyses for all graphs 
are listed in Table 1. Included are the numbers of judges indicating 
whether the intervention was effective or not effective on the basis 
of visual analysis, for both trained and untrained judges, as well as 
both the c-statistic and the z-statistic, upon which the tests of 
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significance are based for time-series, analysis. In the analyses 
which follow, the relationship between visual and statistical analyses 
is reported in two ways: first., whether the relationship is 
significant using a chi-square analysis; and second, the percentage of 
intervention effects that aire misclassif led by visual analysis, using 
the results of the statistical test as the criterion. 

Insert Table 1 about here 

. . \ : 

\ 

The relationship between visual and statistical analyses, for 
"trained and untrained judges appears in Table 2. Although there was a 
significant relationship between visual and statistical analyses for 
trained judges, no such relationship occurred ^r untrained judges. 
Nevertheless, there was only a small difference between the two groups 
in the percentage of effects which were misclassif ied, with visual 
analysis of trained judges only slightly more accurate, than that of 
untrained judges. The major difference appeared in the large number 
of effects that untrained judges viewed as significant, with nearly a 
third again as many as with trained judges. Many of these 
intervention effects were not statistically significant, resulting in 
the significant chi-square and more accurate classification for 
trained judges. Of the statistically significant effects, trained 
judges classified equal proportions as either significant or, not, 
while untrained judges viewed far more as significant. 
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Insert Table 2 about here 

i 

The influence of slope of improvement on. the agreement between 
visual and statistical analyses is summarized, in Table 3, There\/as a 
significant relationship between the results of visual and statistical^ 
analyses when the slope was steep (20 degrees) or low (10 degrees), 
but not when it was intermediate (IS degrees). ^In-Jthis latter 
condition, the number of intervention effects that were misclassif ied- 
approached nearly 50%. Although a low slope of 10 degrees resulted in 
a. significant relationship between the two types of analyses, the 
percentage of misclassif ications was the highest, at 56%. In 
contrast, the percentage misclassif led was quite low (35%) when the 
slope was steeR. 

- — - t 

Insert Table 3 about here 



The types of errors made appeared to vary as a function of the 
level of the slope. With a steep slope, very few judges reported 
nonsignificant intervention _ effects for effects ' that were 
statistical ly significant. Rather, most reported statistically 
nonsignificant effects as appearing to be visually significant. 
However, with a Tow slope, just the opposite occurred. More judges 
rated statistically significant effects as visually nonapparent. With 
a moderate level of slope (15 decrees), there was no difference in the 
types "of errors made in judgment. 
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Little differential effect was present in th? relationship 
between visual and statistical analyses with the use of aimlines 
versus no aimlines (see Table 4). Both of these conditions resulted 
in significant relationships between visual and statistical analyses. 
The most * pronounced difference occurred in the rates of 
misclassif ication. Far fewer interventions were analyzed incorrectly 
when no aimlines were depicted' on the graphs {39%) than when aimlines 
were present {60%). In this latter condition, not only were there 
more of both types of errors, but* also more judgments of significance 
using visual analysis that were riot significant statistically. The 
lowest frequency cell with graphs having aimlines occurred with non- 
significant judgments using both visual and statistical .appraisals. 
Tn^eontrast, when graphs had no aimlines, there were approximately 
equal numbers l)f^— judgments of consistent decisions for both, 
significant and nonsignif^an^^ 



Insert Table 4 about here- . 



Only one sig ificant effect out of four conditions was evident 
for the influence of variability on the agreement between visual and 
statistical analyses, which occurred when variability decreased (see 
Table 5). When variability increased, alT intervention- effects were 
statistical ly nonsignificant, precluding* the use of the two sample 
chi-square statistic. The opposite situation occurred when 
variability decreased, with no graphs displaying statistically non- 
significant effects. For each of these conditions, a one-sample chi- 
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square was used to analyze the degree of correspondence between visual 
and statistical analysis* For increasing variability, no significant 
relationship was found between the two types of analysis while for 
decreasing variability, a highly significant relationship was found. 
In this latter condition, nearly twice * ~as many statistically 
significant interventions were deemed visually significant than those 
judged visually nonsignificant. This resulted in a low percentage. 
{35%) of misclassif ied interventions in which statistical and visual 
analyses disagreed. 



Insert Table 5 about here 

» 

^ m n ™ — ™" 

The graphs depicting constant low variability across both pre and 
post intervention phases showed no significant relationship for the 
two analytic procedures. Finally, with constant high variability 
depicted- in both phases, the relationship between visual and 
statistical analyses approached significance (£ = .06). In both 
conditions of constant variability, as well as in the condition with 
^increased variability, the percentage of misclassif ied intervention 
effects was around 50%. 

The comparison of visual and statistical analyses in terms of the 
reliability from Time 1 to Time 2 was confined to slopes of 10 degrees 
and constant variability, either low or high (see Table 6). A large 
jfference betweert the two conditions of variability was found in the 
significance of the relationship between visual and 1 statistical 
analyses. \With low variability, the relationship between the two 
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types of analyses was significant, while no such finding occurred for 
high variability. Yet, little difference existed in the percentage 
misclassif ied, with both above 50£. For both conditions, more judges 
rated statistically significant effects as visually nonsignificant. 

Insert Table 6 about here 



Discussion 

The major findings of this research corroborate much of the 
previous research conducted in this area (Gottman & Glass, 1978; Jones 
et al., 1975,. 1977, 1978) . The agreement between visual and 
statistical analyses is modest at best (under ideal conditions) and 
otherwise quite low. In general, there ^was a high percentage of 
intervention effects that* were misclassif ied, using the results of the 
statistical analysis as the criterion. That is, many interventions 
were viewed A significant but were not statistically significant, and 
vice versa ^statistically significant effects were viewed as 
nonsignificant). The lowest percentage of misclassif ications occurred 
under the quite ideal conditions of * relatively steep slope (of 20 
degrees) or decreasing variability. Under most of the conditions, the 
percentage of misclassif ied intervention effects hovered around 50%~ 

Although there were several findings, specific to the manipulated 
variables, that were consistent with establ ished data-utilization 
procedures, (Parsonson & Baer, 1978), there also were several findings 
in contrast with previous reports. For instance, with a steep slope 
there were relatively few interventions that were viewed as 
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nonsignificant. Only U% of the statistically significant 
.interventions were visually analyzed as nonsignificant. The problem 
appeared to be one of "seeing" effects that were not there rather than 
one of "not seeing" effects that were present. This finding is 
.somewhat in contrast with the results obtained by Jones et al. (1978), 
in which judges were most in agreement with time series analysis when 
the analyses showed no effect and least in agreement when the analyses 
showed an effect. It is also in contrast to Baer's (1977) analysis of 
behavioral (n-1) research designs as possessing very low probabilities 
of Type I errors and correspondingly high probabilities of Type II 
errors. In this research, there were several conditions in which the 
subjects tended to visual ly analyze effects as significant, though 
statistical analysis did not corroborate the same interpretation: 
when the judges were untrained, when the slope was steep, when 
aimlines were present, .and when variability was constantly high. 

Agreement between , visual and statistical analyses for various 
levels of slope revealed an interesting relationship. Only in the 
extreme levels (either low or high) was the relationship significant. 
Although visual judgment was more consistent with statistical analysis 
when the data array showed more pronounced change (or lack thereof), 
there still was a high percentage of misinterpreted effects. 
Apparently, extremely steep slopes' or absolutely no slope, must be 
present before the relationship appears to be not only statistically 
significant, but one in which the rates of misclassif ication also are 
low. . 

The effect of variability on the relationship between visual and 
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statistical analyses generally was quite predictable. The worst case 
occurred with increased variability. Here, the judges were split 
evenly over the visual . analysis of intervention effects, even though 
all effects had been found to be statistically nonsignificant. Tn 
contrast, when variability decreased, there was a 2 to 1 margin of 
judges viewing the effects as visually significant, which was in 
agreement with the statistical analysis. The effect of low constant 
variability resulted in a nearly even split, similar to that found 
with increasing variability. While most of the interventions depicted 
in graphs with constant high variabil ity were statistical ly 
nonsignificant, the majority of judges viewed the effects as 
significant. This finding is similar to the previous finding reported 
by Tindal et al . (1983), in which some degree of varTabi1ity-ts__ viewed 
as indicative of treatment effects. 

The fact that a higher (more significant) relationship between 
visual and statistical analyses was obtained for trained judges than 
untrained judges is consistent with the predicted treatment effect. 
Apparently, some training effect, albeit modest, occurred through 7 the 
use of a two hour lecture and a .take-home manual. Previous doubts 
concerning the efficacy of training had been expressed by Tindal .et 
al. (1983). This doubt primarily was a result of the low reliability 
obtained by both trained and untrained -judges. Although judgments of 
trained judges showed a higher relationship with statistical analysis, 
it is also true that the levels of misclassif ication were quite high* 

r 

and only slightly improved over that obtained by untrained judges (46% 
vs 5056). 



The most interesting difference between the trained and untrained 

judges was in the types of errors made. Generally, untrained judges 

tended to classify more interventions as significant. This resulted 

in a higher percentage of accurately classified significant effects by 

nearly 10% {50% for trained and 60% for untrained judges). However, 

it appears that the judgments of significant effects- made by untrained 

judges were higher in. general. Again, this is consistent with the 

previous, research by Tindal et al. (1983). The implication of this 

finding is that a higher percentage of Type I errors are made by 

untrained judges - judgments of an effect being made when in fact no 

effect is present - using statistical analysis as the criterion. 

Furthermore, the inherently low probability of Type I error in visual 

♦ 

analysis cannot be assumed, but rather occurs as a function of 
training. 

In contrast to the findings for trained vs untrained^ judges, the 
use of aimlines appeared not only to be ineffective, but actually to 
interfere with judgments of effects. With aimlines present, more of 
the statistically nonsignificant interventions were deemed, effective. 
That is, the aimline appeared tOb sway judgments of . effects when o none 
were actually present. Although aimlines may simplify the decision- 
making process, it is also possible that their use may distort that 
judgment. The problem with the aiml ine/decision rule system may 
reside in the fact that critical data (i.e., slope and variability) 
essentially are ignored. This problem may be remediated by the use of 
an aimline/decision rule similar to that developed by Mirkin, Deno, 
Fuchs, Wesson, Tindal, Marston, and Kuehnle (1981), in which program 



^analysis is. based on the degree- to which' the '.slope of * student 

\ ' ' * 6 - " * ; 

improvement intersects with 'the aimline of the program. 

j The analysis of reliability of agreement appeared as predicted: 

tjith low'constant variability "a significant relationship occurred, 

while no. sucl\ relationship occurred for high constant yariabil ity. 

However, the .percentage of misclassif ied effects,, was quite' high for 

6oth groups. Again, more judgments of significant effects using 

visual analysis were reported when the" tfariabil ity was high. 

In summary, it appears that visual, analysis of time^series dz 

shows very modest agreement with statistical analysis, and 

influenced in quite predictable ways by characteristics of the data. 

The use of^aimlines does not adequately* solve the problems of 'visual 

analysis and 'may well exacerbate them. Although training appears 

necessary for establishing the statistical conclusion validity (Cook & 

Campbell, 1979) of visual analysis, some type of decision rule system 

also must *be developed, given the high rates of misclassif ication of 

effects, using statistical analysis as the criterion. 

V 
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Table 1 

Results of Visual and Statistical Analyses for All Graphs 
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Trained Judges 



Untrained Judges 



Graph Not Not 

Number Effective Effective Effective Effective 



Statistic 



. 1* 


21 


4 


24 


2 


.80 


3.31 


2 


11 


,1 4 


15 


11 


.28 


1 .14 


3* 


15 


10 


22 


4 


.40, 


1 .66 


4* 


■» 6 


19 


13 


13 


.41 


1 .87 


5* 


25 


0 


26 


0 


.86 


3.58 


6 


5 


20 


5 


21 


.17 


0.71 


7 


14 


11 


17 


9 


- .09 


T &. 3.8 


8 


6 


/", 19 ' 


" 12 


14 


*.76 


-0.32. 


9 


22 


.3 


24 ; 


2 \ 


.17 


0.69 


10* 


■ 4 

> 


21 


13 


13 


.82 


3.40 


11* 


5 


20 


2 


24 


.53: 


2.19 


12* 


25 


0 


26 




.81 


3.35 


13 


22 


3 


26 


0 


.29 


1 .21 


.14* 


1 


24 


5. 


21 


.55 


2.27 ... 


15* 


24 


1 


21. 


5 


.53 


2.21 


16- 


7 


18 


20 


6 


.28 


1.16 


17* 


14 


* 11 


15 


11 


.45 


1.87 


18 


4 


21 

<> 


7. 


1 9" 


.22 


0.92 


19* 


2 


23 


6 


20 


.55, 


2.27 


20 


22 


3 


23 


3 " 


.22 


0.91 


21* 


8 


17 


18 


8 


.72 


2.89 


22 


15 


10 


26 


0 


.11 


, 0.47 


23 


6 


19 


11 


15 


.13 


0.56 


24 


1 


24 


''v. 9 


17 


.26 


1 .08 


25. 


6 


19 


8 


18 


.27 


1 .13 


26* 


8 


17 


19 


7 


.66 


2.75 


27* ■"" 


17 


8 


17 ■ 


9 


.64 


2.66 


28 


7 


.18 


•14 


12 V 


.35 


1.40 



Statistically significant ,at ££..05, 
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Table 2 

Relationship Between Results of Visual and Statistical Analyses for 

Trained and Untrained Judges 



r 

Judges 


Visual Analysis, 
Results 


Statistical 
Significant 


Analysis Results 
Nonsignificant 


Trained 3 










Significant 


188 


1 3£ 




Nonsignificant 


187 


190 


Untrai ned b 










Significant 


233 


211 . 




Nonsignificant - 


.157 


127 



a ri = 700; x 2 ■ 4.84, £ <. .05; % misclassified - 46. 
b n = 728;' x 2 ■ -43, £<_ .50; % misclassified = 53. 



/ 
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Table 3 

Relationship Between Results of Visual and Statistical Analyses 
for Graphs -with Slopes of 10°, 15°, and 20° 











Visual Analysis 


Statistical 


Analysis Results 


Slope 






Results 


Significant 


Nonsignificant 


in° a 






o i gn 1 ti cant 


7 7 


1 U 1 








Nonsignificant 


127 


103 


b 

15° 






Significant 


106 


92 


20° C 






Nonsignificant 


98 


112 








Significant 


; 219 


105 








Nonsignificant 


36 


48 


a n - 


408; 


X 2 ■ 


5.27, £ <_ .03; % miscl ass if ied = 56. 




b n = 


408; 


X 2 ■ 


1.66, jd<_ .20; % misclassified = 47. 




c n 


408; 


X 2 = 


16.37, £ < .001 ; % 


misclassified = 35. 
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Table 4 

Relationship Between Results of Visual and Statistical Analyses for 
Graphs with and without Aimlines 



Condition 



Visual Analysis 
Results 



Statistical Analysis Results 
Significant Nonsignificant 



With Aimlines' 



Without Aimlines 



Significant 

Nonsignificant 
fa- 
Significant 
Nonsignificant 



146 
160 

1951 
111 



207 
99 

125 
. 181 



■n = 612; x 2 = 24.10, £ <_ .001; % misclassified = 60. 
3 n = 612 x 2 - 31.18, £ < .001; % misclassified ='39. 
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Table 5 

Relationship Between Results of Visual and Statistical Analyses for 

Four Variability Conditions 



Variabil ity 


Visual Analysis 


Statistical 


Analysis Results 


Condition 


Results 


Significant 


Nonsignificant 


Increasing 

• 


Si an i f icant 
Nnn^ i nn i *f i ra nt 




1 52 

I Jl 




^inni'firant 
Nonsignificant 


1 98 
106 




Constant Low 0 










Significant 


129 


19 




Nonsignificant 


126 


32 


Constant High 01 










Significant 


37 


145 




Nonsignificant 


14 


no 


a n = 304; x 2 = 
- b n = 304; x 2 = 
c n = 306; x 2 = 
d n = 306; x 2 = 


0, £ £ 1 .0; % misclassified = 50. 
27.87, .00V; % misclassified = 35 
2.52, £ <: .13; % misclassified = 47. 
3.71 , £<_ .06; % misclassified = 52. 
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Table 6 

Relationship Between Results of Visual and Statistical Analyses from 
Time 1 to Time 2 for Graphs with Low Slopes (10°) and 
Low or High Constant Variability 



Variability 



Visual Analysis 
Results 



Statistical Analysis Results 
Significant Nonsignificant 



Low Constant 0 



High Constant 



Significant 

Nonsignificant 

/ 

/ 

/• 

Significant 
Nonsignificant 



25 

77 

37 
65 



45 
57 

48 
54 



n = 204; x 2 = 7.85, £ < .01 ; % misclassif ied = 60. 



n = 204; x 2 = 2.02, .16; % misclassified = 55, 
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